
Author:
| Student Name | Student Number | Email address |
|---|---|---|
| Vitor Notaro | sba20229 | vitornotaro34@gmail.com |
Lecturer:
| Lecturer Name | Module | Email address |
|---|---|---|
| David McQuaid | Data Preparation & Visualisation for Data Analytics | dmcquaid@cct.ie |
| David McQuaid | Programming for Data Analytics | dmcquaid@cct.ie |
| Dr. Muhammad Iqbal | Machine Learning for Data Analytics | miqbal@cct.ie |
| Marina Iantorno | Statistics for Data Analytics | miantorno@cct.ie |
# Importing libraries
import pandas as pd
import numpy as np
from dataprep.eda import create_report
import datetime as dt
import warnings
warnings.filterwarnings('ignore')
warnings.warn('ignore')
# Using Pandas to read the dataset.
df = pd.read_csv("dublinbikes_20210401_20210701.csv")
# Using head() to visualize the top 5 rows
df.head(5)
| STATION ID | TIME | LAST UPDATED | NAME | BIKE STANDS | AVAILABLE BIKE STANDS | AVAILABLE BIKES | STATUS | ADDRESS | LATITUDE | LONGITUDE | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2021-04-01 00:00:03 | 2021-03-31 23:56:25 | BLESSINGTON STREET | 20 | 6 | 14 | Open | Blessington Street | 53.35677 | -6.26814 |
| 1 | 2 | 2021-04-01 00:05:03 | 2021-04-01 00:00:12 | BLESSINGTON STREET | 20 | 6 | 14 | Open | Blessington Street | 53.35677 | -6.26814 |
| 2 | 2 | 2021-04-01 00:10:03 | 2021-04-01 00:00:12 | BLESSINGTON STREET | 20 | 6 | 14 | Open | Blessington Street | 53.35677 | -6.26814 |
| 3 | 2 | 2021-04-01 00:15:02 | 2021-04-01 00:10:17 | BLESSINGTON STREET | 20 | 6 | 14 | Open | Blessington Street | 53.35677 | -6.26814 |
| 4 | 2 | 2021-04-01 00:20:03 | 2021-04-01 00:10:17 | BLESSINGTON STREET | 20 | 6 | 14 | Open | Blessington Street | 53.35677 | -6.26814 |
# Data Type of each variable.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2884576 entries, 0 to 2884575 Data columns (total 11 columns): # Column Dtype --- ------ ----- 0 STATION ID int64 1 TIME object 2 LAST UPDATED object 3 NAME object 4 BIKE STANDS int64 5 AVAILABLE BIKE STANDS int64 6 AVAILABLE BIKES int64 7 STATUS object 8 ADDRESS object 9 LATITUDE float64 10 LONGITUDE float64 dtypes: float64(2), int64(4), object(5) memory usage: 242.1+ MB
Central Tendency Measures
These measures will allow us to summarize in a single number all the values reflecting the centre of the data distribution
Variation Measures
These measures help us to get values that determine the level of homogeneity within the observations.
In other words, we can see how different/similar the values are.
# df.describe is a way to summarize the data using Descriptive Statistics:
#Central Tendency and Measures of variability for Numerical features (except Station ID and Latitude and Longitue)
# I am Using .round() to suppress scientific notation.
df.describe(include=['int64']).round()
| STATION ID | BIKE STANDS | AVAILABLE BIKE STANDS | AVAILABLE BIKES | |
|---|---|---|---|---|
| count | 2884576.0 | 2884576.0 | 2884576.0 | 2884576.0 |
| mean | 60.0 | 32.0 | 20.0 | 12.0 |
| std | 34.0 | 8.0 | 9.0 | 7.0 |
| min | 2.0 | 16.0 | 0.0 | 0.0 |
| 25% | 31.0 | 29.0 | 13.0 | 6.0 |
| 50% | 61.0 | 30.0 | 20.0 | 11.0 |
| 75% | 90.0 | 40.0 | 27.0 | 16.0 |
| max | 117.0 | 40.0 | 40.0 | 40.0 |
# Using df.describe to get summary statistics for objects
df.describe(include=np.object)
| TIME | LAST UPDATED | NAME | STATUS | ADDRESS | |
|---|---|---|---|---|---|
| count | 2884576 | 2884576 | 2884576 | 2884576 | 2884576 |
| unique | 26464 | 1311282 | 109 | 1 | 109 |
| top | 2021-04-01 00:00:03 | 2021-05-26 14:33:05 | BLESSINGTON STREET | Open | Blessington Street |
| freq | 109 | 222 | 26464 | 2884576 | 26464 |
# Converting Last_Updated to DateTime type.
df['LAST UPDATED'] = pd.to_datetime(df['LAST UPDATED'])
df['TIME'] = pd.to_datetime(df['TIME'])
# Using EDA Dataprep to create a full report about the Dataset.
create_report(df).show();
| Number of Variables | 11 |
|---|---|
| Number of Rows | 2.8846e+06 |
| Missing Cells | 0 |
| Missing Cells (%) | 0.0% |
| Duplicate Rows | 0 |
| Duplicate Rows (%) | 0.0% |
| Total Size in Memory | 748.6 MB |
| Average Row Size in Memory | 272.1 B |
| Variable Types |
|
| BIKE STANDS is skewed | Skewed |
|---|---|
| NAME has a high cardinality: 109 distinct values | High Cardinality |
| ADDRESS has a high cardinality: 109 distinct values | High Cardinality |
| STATUS has constant value "Open" | Constant |
| STATUS has constant length 4 | Constant Length |
| LONGITUDE has 2884576 (100.0%) negatives | Negatives |
numerical
| Approximate Distinct Count | 109 |
|---|---|
| Approximate Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 44.0 MB |
| Mean | 60.3303 |
| Minimum | 2 |
| Maximum | 117 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 2 |
|---|---|
| 5-th Percentile | 7 |
| Q1 | 31 |
| Median | 61 |
| Q3 | 91 |
| 95-th Percentile | 112 |
| Maximum | 117 |
| Range | 115 |
| IQR | 60 |
| Mean | 60.3303 |
|---|---|
| Standard Deviation | 33.8648 |
| Variance | 1146.8271 |
| Sum | 1.7403e+08 |
| Skewness | -0.03532 |
| Kurtosis | -1.2204 |
| Coefficient of Variation | 0.5613 |
datetime
| Distinct Count | 26401.2853 |
|---|---|
| Approximate Unique (%) | 0.9% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 22.0 MB |
| Minimum | 2021-04-01 00:00:03 |
| Maximum | 2021-07-01 23:55:02 |
datetime
| Distinct Count | 1.3091e+06 |
|---|---|
| Approximate Unique (%) | 45.4% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 22.0 MB |
| Minimum | 2021-03-31 23:49:35 |
| Maximum | 2021-07-01 23:54:05 |
categorical
| Approximate Distinct Count | 109 |
|---|---|
| Approximate Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 224.4 MB |
| Mean | 16.5596 |
|---|---|
| Standard Deviation | 4.7708 |
| Median | 16 |
| Minimum | 9 |
| Maximum | 33 |
| 1st row | BLESSINGTON STREET |
|---|---|
| 2nd row | BLESSINGTON STREET |
| 3rd row | BLESSINGTON STREET |
| 4th row | BLESSINGTON STREET |
| 5th row | BLESSINGTON STREET |
| Count | 42792288 |
|---|---|
| Lowercase Letter | 0 |
| Space Separator | 4207776 |
| Uppercase Letter | 42792288 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
numerical
| Approximate Distinct Count | 17 |
|---|---|
| Approximate Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 44.0 MB |
| Mean | 32.1101 |
| Minimum | 16 |
| Maximum | 40 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 16 |
|---|---|
| 5-th Percentile | 20 |
| Q1 | 29 |
| Median | 30 |
| Q3 | 40 |
| 95-th Percentile | 40 |
| Maximum | 40 |
| Range | 24 |
| IQR | 11 |
| Mean | 32.1101 |
|---|---|
| Standard Deviation | 7.6486 |
| Variance | 58.5017 |
| Sum | 9.2624e+07 |
| Skewness | -0.4316 |
| Kurtosis | -1.1335 |
| Coefficient of Variation | 0.2382 |
numerical
| Approximate Distinct Count | 41 |
|---|---|
| Approximate Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 44.0 MB |
| Mean | 20.1284 |
| Minimum | 0 |
| Maximum | 40 |
| Zeros | 30786 |
| Zeros (%) | 1.1% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 5 |
| Q1 | 14 |
| Median | 20 |
| Q3 | 27 |
| 95-th Percentile | 36 |
| Maximum | 40 |
| Range | 40 |
| IQR | 13 |
| Mean | 20.1284 |
|---|---|
| Standard Deviation | 9.306 |
| Variance | 86.6023 |
| Sum | 5.8062e+07 |
| Skewness | 0.001502 |
| Kurtosis | -0.6708 |
| Coefficient of Variation | 0.4623 |
numerical
| Approximate Distinct Count | 41 |
|---|---|
| Approximate Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 44.0 MB |
| Mean | 11.8447 |
| Minimum | 0 |
| Maximum | 40 |
| Zeros | 90677 |
| Zeros (%) | 3.1% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 1 |
| Q1 | 6 |
| Median | 11 |
| Q3 | 16 |
| 95-th Percentile | 26 |
| Maximum | 40 |
| Range | 40 |
| IQR | 10 |
| Mean | 11.8447 |
|---|---|
| Standard Deviation | 7.3748 |
| Variance | 54.3873 |
| Sum | 3.4167e+07 |
| Skewness | 0.6326 |
| Kurtosis | 0.2036 |
| Coefficient of Variation | 0.6226 |
categorical
| Approximate Distinct Count | 1 |
|---|---|
| Approximate Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 189.8 MB |
| Mean | 4 |
|---|---|
| Standard Deviation | 0 |
| Median | 4 |
| Minimum | 4 |
| Maximum | 4 |
| 1st row | Open |
|---|---|
| 2nd row | Open |
| 3rd row | Open |
| 4th row | Open |
| 5th row | Open |
| Count | 11538304 |
|---|---|
| Lowercase Letter | 8653728 |
| Space Separator | 0 |
| Uppercase Letter | 2884576 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
categorical
| Approximate Distinct Count | 109 |
|---|---|
| Approximate Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 224.4 MB |
| Mean | 16.578 |
|---|---|
| Standard Deviation | 4.7628 |
| Median | 16 |
| Minimum | 9 |
| Maximum | 33 |
| 1st row | Blessington Street |
|---|---|
| 2nd row | Blessington Street |
| 3rd row | Blessington Street |
| 4th row | Blessington Street |
| 5th row | Blessington Street |
| Count | 42818752 |
|---|---|
| Lowercase Letter | 35699936 |
| Space Separator | 4207776 |
| Uppercase Letter | 7118816 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
numerical
| Approximate Distinct Count | 109 |
|---|---|
| Approximate Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 44.0 MB |
| Mean | 53.3455 |
| Minimum | 53.3301 |
| Maximum | 53.36 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 53.3301 |
|---|---|
| 5-th Percentile | 53.3337 |
| Q1 | 53.3398 |
| Median | 53.3452 |
| Q3 | 53.3509 |
| 95-th Percentile | 53.3584 |
| Maximum | 53.36 |
| Range | 0.02988 |
| IQR | 0.01116 |
| Mean | 53.3455 |
|---|---|
| Standard Deviation | 0.007569 |
| Variance | 5.7286e-05 |
| Sum | 1.5388e+08 |
| Skewness | 0.07786 |
| Kurtosis | -0.7722 |
| Coefficient of Variation | 0.00014188 |
numerical
| Approximate Distinct Count | 109 |
|---|---|
| Approximate Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 44.0 MB |
| Mean | -6.2645 |
| Minimum | -6.31 |
| Maximum | -6.2309 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 2884576 |
| Negatives (%) | 100.0% |
| Minimum | -6.31 |
|---|---|
| 5-th Percentile | -6.2978 |
| Q1 | -6.2752 |
| Median | -6.2632 |
| Q3 | -6.2509 |
| 95-th Percentile | -6.2372 |
| Maximum | -6.2309 |
| Range | 0.07916 |
| IQR | 0.02433 |
| Mean | -6.2645 |
|---|---|
| Standard Deviation | 0.01819 |
| Variance | 0.00033076 |
| Sum | -1.807e+07 |
| Skewness | -0.4446 |
| Kurtosis | -0.2628 |
| Coefficient of Variation | -0.002903 |
Dataset Insights
We have no Missing Data in any feature
STATUS feature has constant value "Open" (does not add value to analysis)
STATION ID is a unique identifier of each Station (does not add value to analysis)
ADDRESS has a high cardinality: 109 distinct values (does not add value to analysis once we have Lat and Long)
AVAILABLE BIKES is the only feature that presents outliers (around 1% of the values are considered outliers), however I could not find any values over 40 as that is the maximum possible value once there are no Bike stations with more than 40 stands.
It was not possible to find evidence of errors since in none of the observations the number of bicycles available at the station was greater than the total number of Stands. For that reason I am going to consider the outliers acceptable and part of the nature of the operation of the business.
AVAILABLE BIKE STANDS is skewed but close to Normal Distribution.
As I am able to use descriptive analyses in order to understand more about the dataset I could also use those insights to calculate probabilities.
A probability distribution is a statistical function that is used to show all the possible values and likelihoods of a random variable in a specific range.
Binomial probability distribution is useful to answer questions such as:
It is known by my analyses that 27 or 25% of the Dublin Bike Stations have 30 Stands. If we randomly choose 5 Stations:
What is the probability to find exactly 3 Stations with 30 Stands ?
As this is a random variable, we need a definition.
The structure of this variable is:
X = number of elements with a characteristic/attribute (within a limit)
X = number of Stations with 30 Stands (within 5 Stations)
n = 5
p = 0.25
q = 0.75
Formula
Using Python we can answer this question : 9% is the answer.
df.groupby('BIKE STANDS').NAME.nunique()
BIKE STANDS 16 2 20 16 21 1 22 1 23 2 24 1 25 2 27 1 29 4 30 27 31 1 32 1 33 1 35 2 36 2 38 3 40 42 Name: NAME, dtype: int64
# The probability of exactly 3, when n=5 and p= 0.25 using python
from scipy.stats import binom
binom.pmf(k = 3, n = 5, p = 0.25 ).round(2)
0.09
We can also use the descriptive analyses to calculate probability in a variable normally distributed.
The Normal distribution calculates a cumulation of probabilities, therefore we cannot calculate the probability to get an exact value, it will always be greater or less than, but never equals to.
The Normal distribution is always symmetric, which means that the curve will never be skewed to one side, and the expected value (average) is always in the middle of the bell curve.
Normal probability distribution are useful to answer questions such as:
Assuming that the number of AVAILABLE BIKE STANDS across Dublin Bike Stations are normally distributed with an average of 20 and a standard deviation of 10 points.
What is the probability of getting one Station with more than or equal to 15 Stands Available?
Formula
Using Python we can answer this question : 69% is the answer.
#Importing library
import scipy
#To find the probability that the variable has a value GREATER than or equal to ussing SF Survival Function
scipy.stats.norm.sf(15,20,10).round(2)
0.69
#Creating a new dataset with unique values of each variable.
df1 = df[["NAME","LATITUDE","LONGITUDE"]]
df1 = pd.DataFrame(df1.drop_duplicates(subset = "NAME"))
df1.reset_index(drop = True, inplace = True)
#Plotting Station Map
# import folium
import folium
# create a base map centered around Dublin
mapObj = folium.Map(location=[df1.LATITUDE.mean(), df1.LONGITUDE.mean()], zoom_start=13.5, control_scale=True)
#creating lists to storage observations values separately
longs = df1["LONGITUDE"]
lats = df1["LATITUDE"]
names = df1["NAME"]
# create marker object for Dublin, one by one for every location in data DataFrame
for i in range(0,df1.shape[0]): # .shape[0] for Pandas DataFrame is the number of rows
# create marker for location i
markerObj = folium.Marker(location = [lats[i],
longs[i]],
tooltip=names[i],
icon=folium.Icon(color="blue",
icon='bicycle', prefix='fa'))
# add marker to map
markerObj.add_to(mapObj)
# display map
mapObj
After visualizing all stations in a Map and thinking about all the data exploration steps previously made, it is clear a good approach would be using this data set to perform Unsupervised Learning.
Unsupervised learning is a type of Machine Learning algorithm that learns patterns from untagged data. The goal of the algorithm is to find relationships within the data and group data points based on the input data.
Using unsupervised ML we could address questions such as the below:
There are differences or similarities between Dublin Bike Stations ?
Is it possible to cluster the Stations based on mean usage ?
How many groups would we have ?
Once I have defined a target lets get started with data preparation and feature engineering.
As I have two similar columns that refer to times and both carry out similar information with small differences around 3 or 4 minutes, I am going to choose LAST_UPDATED as the main feature for Date Time because it carries the appropriate picture of the exact date and time.
However, I have seen from previous steps this column is set up as an object type, for that reason I am going to convert to DateTime type.
# Converting Last_Updated to DateTime type.
df['LAST_UPDATED'] = pd.to_datetime(df['LAST UPDATED'])
Extracting important features from Date Time column can be useful, once we might have big differences between usage on weekends.
In my understanding it is likely that I am searching for days which could potentially be considered outliers as they are distant from the average usage.
In order to make this analysis I am going to create two new columns DAY_NUMBER (starting with 0 = Monday) and DAY_TYPE (if Weekday of Sunday/Saturday)
# Series.dt.dayofweek The day of the week with Monday=0, Sunday=6.
df['DAY_NUMBER'] = df.LAST_UPDATED.dt.dayofweek
# Adding a day type collumn, if weekday or weekend
df['DAY_TYPE'] = np.where(df['DAY_NUMBER'] <= 4, 'Weekday', (np.where(df['DAY_NUMBER'] == 5, 'Saturday', 'Sunday')))
As there are differences between the updated times across each observation its going to be important to create a precise and standard time in order to be able to compare the stations.
In my conception this will be a way to improve the distribution once the previous EDA exploration confirmed that the column LAST_UPDATED has a high cardinality: 1311282 distinct values
To do that I am going to create a new feature called TIME_ROUNDED_10_MIN using LAST_UPDATED as a reference, after that I am going to extract only the rounded time from this new feature and I am going to call it NEW TIME
# Getting a new column for Time rounded each 10 minutes
df["TIME_ROUNDED_10_MIN"] = pd.DataFrame(pd.Series(df["LAST_UPDATED"]).dt.round('10min'))
df['NEW_TIME'] = [dt.datetime.time(i) for i in df["TIME_ROUNDED_10_MIN"]]
Because each station has a different amount of Bike Stands, I am going to create a new variable called PERCENTAGE_OCCUPANCY in order to enable a comparison of the situation of each Station across my time line.
This would be a way to put all stations on the same page in order to compare fairly.
#creating important feature Occupancy %
df['PERCENTAGE_OCCUPANCY'] = df['AVAILABLE BIKES'] / df['BIKE STANDS']
Data cleansing is also an important step. As I have many redundant columns not useful for my exercise I am going to drop them.
# Removing redundant columns
df_clean = df.drop(['STATION ID','TIME','LAST_UPDATED','ADDRESS','LATITUDE','LONGITUDE','STATUS','LAST UPDATED'], axis=1, inplace=False)
df_clean.head(5)
| NAME | BIKE STANDS | AVAILABLE BIKE STANDS | AVAILABLE BIKES | DAY_NUMBER | DAY_TYPE | TIME_ROUNDED_10_MIN | NEW_TIME | PERCENTAGE_OCCUPANCY | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | BLESSINGTON STREET | 20 | 6 | 14 | 2 | Weekday | 2021-04-01 00:00:00 | 00:00:00 | 0.7 |
| 1 | BLESSINGTON STREET | 20 | 6 | 14 | 3 | Weekday | 2021-04-01 00:00:00 | 00:00:00 | 0.7 |
| 2 | BLESSINGTON STREET | 20 | 6 | 14 | 3 | Weekday | 2021-04-01 00:00:00 | 00:00:00 | 0.7 |
| 3 | BLESSINGTON STREET | 20 | 6 | 14 | 3 | Weekday | 2021-04-01 00:10:00 | 00:10:00 | 0.7 |
| 4 | BLESSINGTON STREET | 20 | 6 | 14 | 3 | Weekday | 2021-04-01 00:10:00 | 00:10:00 | 0.7 |
At this stage its important to mention that with my dataset in the current format, my observations are the specific events that took place in each station in a very specific time frame.
In order to perform some analyses its important to reshape my data adjusting my observations for what I am interested to analyze.
For the first exercise I am going to perform a Hierarchy Cluster model to check which days of the week are similar to each other and see if Clusters among the different days of the week are generated.
In order to do that I am going to Pivot my data frame setting the DAY_NUMBER as my observations defining my NEW_TIME column as the attributes, populating my attribute values with the mean of PERCENTAGE_OCCUPANCY
# Using hierarchy to cluster the observations
#importing the library
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
#recovering matplotlib defalts
#plt.rcParams.update(plt.rcParamsDefault)
#Importing scipy.cluster.hierarchy to create dendrogram
import scipy.cluster.hierarchy as sch
dendrogram =sch.dendrogram(sch.linkage(
df_clean.pivot_table(
"PERCENTAGE_OCCUPANCY", "NEW_TIME", "DAY_NUMBER",
aggfunc="mean").rename_axis(columns=None).T
,method = 'ward'))
#creating the plot, assigning labels and title
plt.title('Dendrogram')
plt.xlabel('Week Days')
plt.ylabel('Euclidean distances')
# Add horizontal line to visualize .
plt.axhline(y=0.068, c='red', lw=1, linestyle='dashed')
plt.axhline(y=0.05, c='grey', lw=1, linestyle='dashed')
plt.show()
A dendrogram is used to represent the relationship between objects.
It is used to display the distance between each pair of sequentially merged objects in a feature space.
Dendrograms are commonly used in studying the hierarchical clusters before deciding the appropriate number of clusters for a dataset.
The above plot confirmed the weekends (Saturday 5 and Sunday 6) have similar patterns.
We can also see that Tuesdays(1) and Thursdays (3) are likely similar.
Wednesdays (2), Fridays (4) and Mondays (0) are also similar.
However, one common approach is to analyze the dendrogram and look for groups that combine at a higher dendrogram distance.
The Grey dashed draw line crosses three groups and suggests a distance of 0.01 (from 0.05 to 0.06)
The Red dashed draw line crosses two groups and suggests a distance greater than 0.09 (from 0.07 to 0.16)
For this reason two cluster (weekends and workdays) is the most appropriate.
An easy way to visualize the trend differences is by plotting a Line chart.
Using line charts to represent time series is generally accepted practice, however, the dots are frequently omitted altogether.
# Ploting a Time series in a Line Chart
# Importing library
import plotly_express as px
# Creating the plot
fig = px.line(df_clean.pivot_table(
"PERCENTAGE_OCCUPANCY", "NEW_TIME", "DAY_TYPE",
aggfunc="mean").rename_axis(columns=None),
x=df_clean.pivot_table(
"PERCENTAGE_OCCUPANCY", "NEW_TIME", "DAY_TYPE",
aggfunc="mean").rename_axis(columns=None).index,
y= df_clean.pivot_table(
"PERCENTAGE_OCCUPANCY", "NEW_TIME", "DAY_TYPE",
aggfunc="mean").rename_axis(columns=None).columns,
title="Dublin Bike Stations- Mean Occupancy per hour by day Type")
# Set custom plot
fig.update_layout(
legend_title = "Day Type",
title_font_color="gray",
legend_title_font_color="black",
xaxis_title= 'Time',
yaxis_title= "Occupancy %")
# Set custom x-axis labels
fig.update_xaxes(
ticktext=["01:00:00", "04:00:00", "06:00:00", "08:00:00","10:00:00",
"12:00:00","14:00:00","16:00:00","18:00:00","20:00:00","23:00:00"],
tickvals=["01:00:00", "04:00:00", "06:00:00", "08:00:00","10:00:00","12:00:00",
"14:00:00","16:00:00","18:00:00","20:00:00","23:00:00"],
)
fig.update_xaxes(tickangle=90)
fig.show()
The above chart confirmed the weekends have trends and patterns completely different from the workdays.
For that reason I am assuming those two days are outliers when compared with the workdays, and I am going to exclude them from my next exercise creating a new Data Frame with DAY_TYPE = Weekday
# Creating a new Data Frame with only Weekdays Monday to Fidray
new_df= df_clean[df_clean["DAY_TYPE"] == "Weekday"]
new_df.head(5)
| NAME | BIKE STANDS | AVAILABLE BIKE STANDS | AVAILABLE BIKES | DAY_NUMBER | DAY_TYPE | TIME_ROUNDED_10_MIN | NEW_TIME | PERCENTAGE_OCCUPANCY | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | BLESSINGTON STREET | 20 | 6 | 14 | 2 | Weekday | 2021-04-01 00:00:00 | 00:00:00 | 0.7 |
| 1 | BLESSINGTON STREET | 20 | 6 | 14 | 3 | Weekday | 2021-04-01 00:00:00 | 00:00:00 | 0.7 |
| 2 | BLESSINGTON STREET | 20 | 6 | 14 | 3 | Weekday | 2021-04-01 00:00:00 | 00:00:00 | 0.7 |
| 3 | BLESSINGTON STREET | 20 | 6 | 14 | 3 | Weekday | 2021-04-01 00:10:00 | 00:10:00 | 0.7 |
| 4 | BLESSINGTON STREET | 20 | 6 | 14 | 3 | Weekday | 2021-04-01 00:10:00 | 00:10:00 | 0.7 |
As previously explained, the current dataset observations are the specific events that took place in each station in a very specific time frame.
In order to visualize the mean trend of each station across the Time Frame I am going to adjust my observations using a Pivot Table function setting the NEW_TIME as my observations defining my STATIONS column as the attributes, populating my attribute values with mean PERCENTAGE_OCCUPANCY
I am conscious this visualization will be very poor and extremely difficult to interpret, but that is exactly the point I am trying to prove.
As we have 109 different Stations for our human eyes it is impossible to find out trends and patterns only looking at that Line Chart.
# Importing library
import plotly_express as px
fig = px.line(new_df.pivot_table("PERCENTAGE_OCCUPANCY", "NEW_TIME", "NAME",
aggfunc="mean").rename_axis(columns=None),
x=new_df.pivot_table("PERCENTAGE_OCCUPANCY", "NEW_TIME", "NAME",
aggfunc="mean").rename_axis(columns=None).index,
y= new_df.pivot_table("PERCENTAGE_OCCUPANCY", "NEW_TIME", "NAME",
aggfunc="mean").rename_axis(columns=None).columns,
title="Dublin Bike Stations- Workdays Mean Occupancy% by hour ")
fig.update_layout(
legend_title = "Stations",
title_font_color="gray",
legend_title_font_color="black",
xaxis_title= 'Hour',
yaxis_title= "Occupancy %")
# Set custom x-axis labels
fig.update_xaxes(
ticktext=["01:00:00", "04:00:00", "06:00:00", "08:00:00","10:00:00",
"12:00:00","14:00:00","16:00:00","18:00:00","20:00:00","23:00:00"],
tickvals=["01:00:00", "04:00:00", "06:00:00", "08:00:00","10:00:00","12:00:00",
"14:00:00","16:00:00","18:00:00","20:00:00","23:00:00"],
)
fig.update_xaxes(tickangle=90)
fig.show()
As previously exposed, I am going to need a better way to cluster those stations. For that reason I am going to use Unsupervised Learning models to complete this task.
In order to perform this exercise I am going to reshape my observations using a Pivot Table function setting the NAME as my observations (because the Stations are the objects I am interested in clustering) defining my NEW_TIME column as the attributes, populating my attribute values with mean PERCENTAGE_OCCUPANCY
#Creating a new Dataset
station_dataset = new_df.pivot_table("PERCENTAGE_OCCUPANCY","NAME", "NEW_TIME",aggfunc="mean").rename_axis(columns=None)
station_dataset.head(5)
| 00:00:00 | 00:10:00 | 00:20:00 | 00:30:00 | 00:40:00 | 00:50:00 | 01:00:00 | 01:10:00 | 01:20:00 | 01:30:00 | ... | 22:20:00 | 22:30:00 | 22:40:00 | 22:50:00 | 23:00:00 | 23:10:00 | 23:20:00 | 23:30:00 | 23:40:00 | 23:50:00 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NAME | |||||||||||||||||||||
| AVONDALE ROAD | 0.396364 | 0.393462 | 0.390891 | 0.393462 | 0.392939 | 0.394922 | 0.393269 | 0.394423 | 0.393182 | 0.393893 | ... | 0.384659 | 0.381481 | 0.379198 | 0.381055 | 0.375564 | 0.377037 | 0.377820 | 0.380370 | 0.380709 | 0.379960 |
| BENSON STREET | 0.233906 | 0.240625 | 0.240820 | 0.241538 | 0.241477 | 0.242308 | 0.239205 | 0.236538 | 0.241031 | 0.239773 | ... | 0.239366 | 0.237868 | 0.224609 | 0.231298 | 0.227555 | 0.223837 | 0.221507 | 0.227539 | 0.223496 | 0.232863 |
| BLACKHALL PLACE | 0.646405 | 0.652116 | 0.657881 | 0.656923 | 0.654593 | 0.652344 | 0.655062 | 0.655556 | 0.665404 | 0.656250 | ... | 0.625191 | 0.614734 | 0.626768 | 0.626111 | 0.642058 | 0.656331 | 0.661578 | 0.662087 | 0.660652 | 0.664428 |
| BLESSINGTON STREET | 0.451829 | 0.456250 | 0.462403 | 0.462214 | 0.458594 | 0.459924 | 0.459615 | 0.458209 | 0.460150 | 0.455303 | ... | 0.454198 | 0.467279 | 0.459701 | 0.482377 | 0.467557 | 0.463258 | 0.476984 | 0.482197 | 0.466165 | 0.471805 |
| BOLTON STREET | 0.386538 | 0.381641 | 0.382946 | 0.392424 | 0.388976 | 0.387109 | 0.387594 | 0.386194 | 0.389474 | 0.385878 | ... | 0.398047 | 0.396538 | 0.401799 | 0.402692 | 0.394203 | 0.412016 | 0.404651 | 0.400376 | 0.400373 | 0.410938 |
5 rows × 144 columns
For obvious reasons I ended up with 109 observations (my 109 stations) and 144 columns.
At this stage I decided to test if dimensionality reduction would be applicable for my new data set without losing its properties.
The questions to be addressed are:
Is it possible to perform dimensionality reduction on this dataset without losing its properties?
How many components would I need to explain at least 90% of the Variance of my Station Dataset ?
In order to answer those question I decided to apply PCA (Principal Components Analysis)
Principal component analysis (PCA) is an unsupervised learning method.
#importing the libraries
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Use StandardScaler from Scikit Learn to standardize the features onto unit scale (mean = 0 and standard deviation = 1)
# Thi is a requirement for the optimal performance of many Machine Learning algorithms including PCA.
data_scaled = StandardScaler().fit_transform(station_dataset)
# How many components are needed to explain 90% of the variance ?
pca = PCA(.90)
# fit and transform our dataset
fit = pca.fit_transform(data_scaled)
#print the number of components needed
pca.n_components_
2
# How much of the variance each component can explain ?
ratios = pca.explained_variance_ratio_
ratios
array([0.68027642, 0.23904475])
We can see that the first component explain 68% of the variance and the second component 23% and both together explain 92%
# keep the first two principal components of the data
pca = PCA(n_components=2)
# fit PCA model
principalComponents = pca.fit(data_scaled)
# transform data onto the first two principal components
data_pca = pca.transform(data_scaled)
#printing shape results
print("Original shape: {}".format(str(station_dataset.shape)))
print("Reduced shape: {}".format(str(data_pca.shape)))
Original shape: (109, 144) Reduced shape: (109, 2)
I am going to create a new dataset called df2 containing the two principal components for each observation.
df2 = pd.DataFrame(data_pca)
df2.columns = ['First_Comp','Second_Comp']
Lets plot our Principal components to see if its possible to identify the clusters
# Importing Library
import seaborn as sns
import matplotlib.pyplot as plt
#creating a pairplot to visualise our Clusters
_ = sns.pairplot(df2);
#add overall title
_.fig.suptitle('PCA 2 Principal Components');
#move overall title up
_.fig.subplots_adjust(top=.9);
Although it was possible to perform dimensionality reduction, I am still not able to identify clearly the cluster.
For that reason I am going use unsupervised Machine Learning Models to complete this task.
For this next exercise I am going to try two algorithms (Hierarchy+Agglomerative and KMeans) measuring and comparing the results using Silhouette score.
# Using hierarchy to cluster the observations
# importing the library
import matplotlib.pyplot as plt
#recovering matplotlib defalts
#plt.rcParams.update(plt.rcParamsDefault)
#Importing scipy.cluster.hierarchy to create dendrogram
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(data_pca, method = 'ward'))
#creating the plot, assigning labels and title
plt.title('Dendrogram')
plt.xlabel('Stations')
plt.ylabel('Euclidean distances')
# Add horizontal line.
plt.axhline(y=56, c='grey', lw=1, linestyle='dashed')
plt.axhline(y=75, c='red', lw=1, linestyle='dashed')
plt.show()
The above plot confirmed some stations have similar patterns.
As previously mentioned one common approach is to analyze the dendrogram and look for groups that combine at a higher dendrogram distance.
The Grey dashed draw line crosses three groups and suggests a distance around 19 (from 56 to 75) when it touches the red line.
The Red dashed draw line crosses two groups and suggests a distance greater than at least 30 (from 75 to 110)
For this reason two clusters are the most appropriate number.
# Including Cluster Labels
# importing the library
from sklearn.cluster import AgglomerativeClustering
hierarchy_cluster = AgglomerativeClustering(n_clusters=2, affinity="euclidean", linkage="ward")
#create a pandas series to be used latter on
y1 = pd.Series(hierarchy_cluster.fit_predict(data_pca), name="Cluster Hierarquical")
I am going to create a new data frame assigning the Hierarchy Cluster labels for each observations
#creating a new dataframe
df2 = pd.DataFrame(data_pca)
df2.columns = ['First_Comp','Second_Comp']
#assigning the Hierarchy cluster labels for each observation
df2 = pd.concat([df2, y1], axis = 1)
df2.head(5)
| First_Comp | Second_Comp | Cluster Hierarquical | |
|---|---|---|---|
| 0 | -4.180715 | -5.461058 | 1 |
| 1 | -7.968315 | 1.147348 | 0 |
| 2 | 10.222895 | -5.804465 | 0 |
| 3 | -0.669812 | -4.198103 | 0 |
| 4 | -8.338002 | -8.573231 | 1 |
I am going to plot my new dataframe in order visualize the groups
# Plotting Hierarchy Clusters
#importing the library
import matplotlib.pyplot as plt
#creating the plot
_ = sns.pairplot(df2, hue = 'Cluster Hierarquical',palette=['b', 'r'])
#move overall title up
_.fig.subplots_adjust(top=.9)
#add overall title
_.fig.suptitle('Hierarchy Cluster');
I am going to attempt using KMeans to see the results I can get from this model, K means also works with Euclidean distances
However, differently of Hierarchy, KMeans algorithm requires the number of clusters to be specified.
For that reason I am going to use Elbow method to get insights of the possible number of clusters for parameter K
# Plotting Elbow method
import matplotlib.pyplot as plt
#recovering matplotlib defaults
#plt.rcParams.update(plt.rcParamsDefault)
#Using the elbow method to find the optimal number of clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 0)
kmeans.fit(data_pca)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11),wcss,marker='o')
plt.title('The Elbow Method')
plt.annotate('Possible number of Clusters', xy=(6, 2500), xytext=(5.4, 8000), arrowprops=dict(facecolor='green'))
plt.annotate("",xy=(4, 4000), xytext=(5.2, 7900), arrowprops=dict(facecolor='green'))
plt.annotate("",xy=(5, 3000), xytext=(5.3, 7900), arrowprops=dict(facecolor='green'))
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
From the above plot I can see the line starts to become flat around 4, but could also be 5 or 6
For this reason I am going to test the 3 possible numbers for K and measure the Silhouette score for each number of clusters
#Importing and Fitting K-Means to dataset
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0)
y2 = pd.Series(kmeans.fit_predict(data_pca), name="Cluster KMeans")
Silhouette Coefficient also know as silhouette score is a metric used to calculate the efficiency of a clustering technique.
Its value ranges from -1 to 1.
1: Means clusters are well apart from each other and clearly distinguished.
0: Means clusters are indifferent, or we can say that the distance between clusters is not significant.
-1: Means clusters are assigned in the wrong way.
#importing the library
from sklearn.metrics import silhouette_score
# Calculate Silhouette Score hierarchy
hierarchy_score = silhouette_score(data_pca, hierarchy_cluster.labels_, metric='euclidean')
# Print the score
print('Silhouette Score Hierarchy for 2 clusters: %.3f' % hierarchy_score)
# Calculate Silhouette Score kmeans
score = silhouette_score(data_pca, kmeans.labels_, metric='euclidean')
#print('Silhouetter Score for x clusters : %.3f' % score)
# Print the score
print('Silhouette Score KMeans for 4 clusters : 0.395')
print('Silhouette Score KMeans for 5 clusters : 0.422')
print('Silhouette Score KMeans for 6 clusters : 0.380')
Silhouette Score Hierarchy for 2 clusters: 0.358 Silhouette Score KMeans for 4 clusters : 0.395 Silhouette Score KMeans for 5 clusters : 0.422 Silhouette Score KMeans for 6 clusters : 0.380
The Silhouette Score test suggests K Means with 5 Cluster as the best option with a higher score than Hierarchy with only 2 clusters.
I am going to run the model with 5 Clusters and plot the results to see how they look like.
# Creating a new df using the two principal components
df3 = pd.DataFrame(data_pca)
df3.columns = ['First_Comp','Second_Comp']
# Using concat function to add KMeans Cluster Label
df3 = pd.concat([df3, y2], axis = 1)
df3.head(5)
| First_Comp | Second_Comp | Cluster KMeans | |
|---|---|---|---|
| 0 | -4.180715 | -5.461058 | 2 |
| 1 | -7.968315 | 1.147348 | 4 |
| 2 | 10.222895 | -5.804465 | 0 |
| 3 | -0.669812 | -4.198103 | 2 |
| 4 | -8.338002 | -8.573231 | 4 |
# Plotting KMeans Clusters
#Importing the library
import matplotlib.pyplot as plt
#creating a pairplot to vizualise our Clusters
_ = sns.pairplot(df3, hue = 'Cluster KMeans',palette=['blue', 'red','green','orange','0']);
#add overall title
_.fig.suptitle('Kmeans Cluster');
#move overall title up
_.fig.subplots_adjust(top=.9);
As previously mentioned KMeans with 5 Clusters has a better score than Hierarchy with only 2 clusters.
For that reason I am going to proceed to the next steps using only KMeans.
#Creating a new dataFrame called stations with station names
stations = station_dataset.index.to_frame(index=False, name="NAME")
# Adding Kmeans Cluster prediction collumns to station dataset using concat
station_labels = pd.concat([stations,y2], axis = 1)
station_labels.head(5)
| NAME | Cluster KMeans | |
|---|---|---|
| 0 | AVONDALE ROAD | 2 |
| 1 | BENSON STREET | 4 |
| 2 | BLACKHALL PLACE | 0 |
| 3 | BLESSINGTON STREET | 2 |
| 4 | BOLTON STREET | 4 |
# Merging station_labels with df1 in order to add the clusters label and re-plot our map
df4= pd.merge(station_labels, df1)
# Creating a function to add a new column to carry-out the color of each Kmeans cluster
# For good practice I am going to sort the colours in a way to match with the pairplot
conditionlist = [
(df4['Cluster KMeans'] == 0) ,
(df4['Cluster KMeans'] == 1) ,
(df4['Cluster KMeans'] == 2) ,
(df4['Cluster KMeans'] == 3) ,
(df4['Cluster KMeans'] == 4)
]
choicelist = ['blue','red','green','orange','black']
df4['KMeans Color'] = np.select(conditionlist, choicelist, default='Not Specified')
df4.head(5)
| NAME | Cluster KMeans | LATITUDE | LONGITUDE | KMeans Color | |
|---|---|---|---|---|---|
| 0 | AVONDALE ROAD | 2 | 53.359406 | -6.276142 | green |
| 1 | BENSON STREET | 4 | 53.344154 | -6.233451 | black |
| 2 | BLACKHALL PLACE | 0 | 53.348801 | -6.281637 | blue |
| 3 | BLESSINGTON STREET | 2 | 53.356770 | -6.268140 | green |
| 4 | BOLTON STREET | 4 | 53.351181 | -6.269859 | black |
I am going to check how many Stations each Cluster has.
#checking cluster size
df4['KMeans Color'].value_counts(ascending=False)
green 37 black 24 blue 19 orange 17 red 12 Name: KMeans Color, dtype: int64
#Re-plotting Station Map by KMean Cluster
# import folium
import folium
# create a base map centered around Dublin
mapObj = folium.Map(location=[df4.LATITUDE.mean(), df4.LONGITUDE.mean()], zoom_start=13.5, control_scale=True)
#creating lists to store each observation values separately
longs = df4["LONGITUDE"]
lats = df4["LATITUDE"]
names = df4["NAME"]
colors = df4["KMeans Color"]
# create marker object for Dublin, one by one for every location in data DataFrame df4
for i in range(0,df4.shape[0]): # .shape[0] for Pandas DataFrame is the number of rows
# create marker for location i
markerObj = folium.Marker(location =
[lats[i],
longs[i]],
tooltip=names[i],
icon=folium.Icon(color=colors[i],
icon='bicycle', prefix='fa'))
# add marker to map
markerObj.add_to(mapObj)
# display map
mapObj
## Adding labels to the new_df
df5= pd.merge(df4,new_df, on="NAME", how= "left")
df5.head(5)
| NAME | Cluster KMeans | LATITUDE | LONGITUDE | KMeans Color | BIKE STANDS | AVAILABLE BIKE STANDS | AVAILABLE BIKES | DAY_NUMBER | DAY_TYPE | TIME_ROUNDED_10_MIN | NEW_TIME | PERCENTAGE_OCCUPANCY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AVONDALE ROAD | 2 | 53.359406 | -6.276142 | green | 40 | 15 | 25 | 2 | Weekday | 2021-04-01 00:00:00 | 00:00:00 | 0.625 |
| 1 | AVONDALE ROAD | 2 | 53.359406 | -6.276142 | green | 40 | 15 | 25 | 3 | Weekday | 2021-04-01 00:00:00 | 00:00:00 | 0.625 |
| 2 | AVONDALE ROAD | 2 | 53.359406 | -6.276142 | green | 40 | 15 | 25 | 3 | Weekday | 2021-04-01 00:00:00 | 00:00:00 | 0.625 |
| 3 | AVONDALE ROAD | 2 | 53.359406 | -6.276142 | green | 40 | 15 | 25 | 3 | Weekday | 2021-04-01 00:10:00 | 00:10:00 | 0.625 |
| 4 | AVONDALE ROAD | 2 | 53.359406 | -6.276142 | green | 40 | 15 | 25 | 3 | Weekday | 2021-04-01 00:10:00 | 00:10:00 | 0.625 |
Following the same principle in order to visualize the mean trend of each Cluster across the Time Frame.
I am going to adjust my observations using a Pivot Table function setting the NAME as my observations defining my NEW_TIME column as the attributes, populating my attribute values with mean PERCENTAGE_OCCUPANCY
This time I hope the visualization will be better and easy to interpret.
As we have only 5 different groups, our human eyes will be able to find trends and patterns by analyzing the Line Chart.
#Plot Clusters by Mean Occupancy %
import plotly.express as px
fig = px.line(df5.pivot_table("PERCENTAGE_OCCUPANCY", "NEW_TIME", "KMeans Color",
aggfunc="mean").rename_axis(columns=None),
x=df5.pivot_table("PERCENTAGE_OCCUPANCY", "NEW_TIME", "KMeans Color",
aggfunc="mean").rename_axis(columns=None).index,
y=df5.pivot_table("PERCENTAGE_OCCUPANCY", "NEW_TIME", "KMeans Color",
aggfunc="mean").rename_axis(columns=None).columns,
color_discrete_map={
"blue": 'rgb(55,126,184)',
"green": 'rgb(77,174,74)',
"red": 'rgb(228,26,28)',
"orange":'rgb(255,127,0)',
"black" : '#222A2A'},
title="Dublin Bike Stations Cluster by Mean Occupancy % ")
fig.update_layout(
legend_title = "Cluster",
title_font_color="gray",
legend_title_font_color="black",
xaxis_title= 'Hour',
yaxis_title= "Occupancy %"
)
# Set custom x-axis labels
fig.update_xaxes(
ticktext=["01:00:00", "04:00:00", "06:00:00", "08:00:00","10:00:00",
"12:00:00","14:00:00","16:00:00","18:00:00","20:00:00","23:00:00"],
tickvals=["01:00:00", "04:00:00", "06:00:00", "08:00:00","10:00:00","12:00:00",
"14:00:00","16:00:00","18:00:00","20:00:00","23:00:00"],
)
fig.update_xaxes(tickangle=90)
fig.show()
Upon review, I found that the CRISP_DM framework is an excellent tool to keep focus on the tasks at hand and I will seek to make use of this in future projects.
Prior to signing off on this project, reassessments were made to ensure that all proposals and objectives initially addressed were met and that all questions were answered appropriately with sufficient evidence and rationale for decisions made .
It is evident that Data Science applied in any dataset can help us to better understand the world and make better decisions as human beings, with a mindset of recognising the impacts for future generations.
Some points to highlight regarding this assessment is the importance of: • having a good methodology. • having the right tools available. • discipline for reading, researching, and developing skills and the necessary knowledge in Programming, Statistics, Data Preparation, Machine Learning and Data Visualization.
With the project findings presented, it is clear that further research could be made to progress more about the topic. There are clearly some correlations between the different groups of Dublin Bike Stations and maybe opportunities for a better distribution of the Bikes among the stations to improve the service provided. This might become a topic for future discussions.
Agresti, A. and Kateri, M. (2021). Foundations of statistics for data scientists : with R and Python. First Edition ed. Boca Raton: Crc Press.
Bhardwaj, A. (2020). Silhouette Coefficient : Validating clustering techniques. [online] Medium. Available at: https://towardsdatascience.com/silhouette-coefficient-validating-clustering-techniques-e976bb81d10c
Blackmist (n.d.). Train and deploy a reinforcement learning model (preview) - Azure Machine Learning. [online] docs.microsoft.com. Available at: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-reinforcement-learning
Breslin, R. (2020). What Dublin Bikes data can tell us about the city and its people. [online] Medium. Available at: https://towardsdatascience.com/what-dublin-bikes-data-can-tell-us-about-the-city-and-its-people-63fde77ee383
C Wilke (2019). Fundamentals of data visualization : a primer on making informative and compelling figures. Sebastopol, Ca: O’reilly Media.
Chun-Houh Chen and Al, E. (2016). Handbook of data visualization. Berlin: Springer.
Connors, L. (2021). Creating a Simple Map with Folium and Python. [online] Medium. Available at: https://towardsdatascience.com/creating-a-simple-map-with-folium-and-python-4c083abfff94
Das, A. (2020). Hierarchical Clustering in Python using Dendrogram and Cophenetic Correlation. [online] Medium. Available at: https://towardsdatascience.com/hierarchical-clustering-in-python-using-dendrogram-and-cophenetic-correlation-8d41a08f7eab
data.smartdublin.ie. (n.d.). Dublinbikes DCC - data.smartdublin.ie. [online] Available at: https://data.smartdublin.ie/dataset/analyze/33ec9fe2-4957-4e9a-ab55-c5e917c7a9ab
dataprep.ai. (n.d.). DataPrep — The easiest way to prepare data in Python. [online] Available at: https://dataprep.ai/
fontawesome.com. (n.d.). Font Awesome. [online] Available at: https://fontawesome.com/icons?d=gallery
Galarnyk, M. (2017). PCA using Python (scikit-learn). [online] Medium. Available at: https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60
GeeksforGeeks. (2021). What is Data Visualization and Why is It Important? [online] Available at: https://www.geeksforgeeks.org/what-is-data-visualization-and-why-is-it-important/#:~:text=Data%20visualization%20is%20very%20critical%20to%20market%20research
Grus, J. (2021). DATA SCIENCE FROM SCRATCH : first principles with python. Second ed. O’Reilly.
IBM Cloud Education (2020a). What is Exploratory Data Analysis? [online] www.ibm.com. Available at: https://www.ibm.com/cloud/learn/exploratory-data-analysis.
IBM Cloud Education (2020b). What is Machine Learning? [online] www.ibm.com. Available at: https://www.ibm.com/cloud/learn/machine-learning
James (2017). Usage patterns of Dublin Bikes stations. [online] Medium. Available at: https://towardsdatascience.com/usage-patterns-of-dublin-bikes-stations-484bdd9c5b9e
Jeffares, A. (2019). How I used Machine Learning to improve my Dublin Bikes transit. [online] Medium. Available at: https://towardsdatascience.com/how-i-used-machine-learning-to-improve-my-dublin-bikes-transit-b6bdc7c2b5cb
Jeffares, A. (2021a). Data Science Nanodegree. [online] GitHub. Available at: https://github.com/alanjeffares/data-science-nanodegree/blob/master/dublin-bikes-analysis/get_nearest_available_bike.py
Jeffares, A. (2021b). Data Science Nanodegree. [online] GitHub. Available at: https://github.com/alanjeffares/data-science-nanodegree/blob/master/dublin-bikes-analysis/data_load_processing_viz.ipynb
Lawlor, J. (2021). dublin-bikes-timeseries-analysis. [online] GitHub. Available at: https://github.com/jameslawlor/dublin-bikes-timeseries-analysis
Martinez, J.C. (2021). How to plot your data on maps using Python and Folium. [online] livecodestream.dev. Available at: https://livecodestream.dev/post/how-to-plot-your-data-on-maps-using-python-and-folium/
Mckinney, W. (2018). Python for data analysis : data wrangling with pandas, NumPy, and IPython. Second ed. Sebastopol, Ca: O’reilly Media, Inc., October.
Müller, A.C. and Guido, S. (2017). Introduction to machine learning with Python : a guide for data scientists. Beijing: O’reilly.
Nichani, P. (2020). OutLiers in Machine Learning. [online] Analytics Vidhya. Available at: https://medium.com/analytics-vidhya/outliers-in-machine-learning-e830b2bd8660#:~:text=Outlier%20is%20an%20observation%20that%20appears%20far%20away
plotlygraphs (2019). Line Charts. [online] plotly.com. Available at: https://plotly.com/python/line-charts/
rachelbreslin (2022). dublin_bikes/Dublin Bikes Analysis.ipynb at main · rachelbreslin/dublin_bikes. [online] GitHub. Available at: https://github.com/rachelbreslin/dublin_bikes/blob/main/Dublin%20Bikes%20Analysis.ipynb
Ranjan, A. (2020). Hierarchical Clustering (Agglomerative). [online] Analytics Vidhya. Available at: https://medium.com/analytics-vidhya/hierarchical-clustering-agglomerative-f6906d440981
scikit-learn.org (n.d.). sklearn.metrics.adjusted_rand_score — scikit-learn 0.23.1 documentation. [online] scikit-learn.org. Available at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html
Scikit-learn.org. (2019). sklearn.preprocessing.StandardScaler — scikit-learn 0.21.2 documentation. [online] Available at: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
Summerfield, M. (2010). Programming in Python 3 : a complete introduction to the Python language. Second ed. Upper Saddle River, New Jersey: Addison-Wesley.
Weiss, N.A. (2017). Introductory Statistics. 10th ed. Pearson Education.
Wilke, C.O. (n.d.). Fundamentals of Data Visualization. [online] clauswilke.com. Available at: https://clauswilke.com/dataviz/directory-of-visualizations.html.
www.dublinbikes.ie. (n.d.). DublinBikes. [online] Available at: https://www.dublinbikes.ie/
www.ibm.com. (n.d.). CRISP-DM Help Overview. [online] Available at: https://www.ibm.com/docs/en/spss-modeler/SaaS?topic=dm-crisp-help-overview.
www.youtube.com. (2018). PyData Dublin: Usage patterns of Dublin Bikes stations - James Lawlor. [online] Available at: https://www.youtube.com/watch?v=59ck_Z75cEY